We will do a clustering analysis on the iris data set, using the k-means technique and a hierarchical clustering approach

The iris data set is one of R’s built-in data-sets, and is also available at the UCI Machine Learning Repository.

The data set was created by statistician Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis.

The data consists of 3 classes of flower types:

The data has 4 attributes:

This famous (Fisher’s or Anderson’s) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.

The built-in data set iris is a data frame with 150 cases (rows) and 5 variables (columns) named Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and Species.

library(tidyverse)

Let us first create a plot using the variables Petal.Width and Petal.Width, coloring every point by their class:

ggplot(iris, aes(x = Petal.Width, y = Petal.Length, color = Species)) +
  geom_point()

Using k-means

Let us try to find the 3 groups using k-means

# use kmeans() with the subset of columns we want to consider
iris_mini <- iris[, c("Petal.Width", "Petal.Length")]
km_iris <- kmeans(x = iris_mini, centers = 3)

The object km_iris contains an attribute called cluster that we want to extract to see the different assignments:

str(km_iris)
List of 9
 $ cluster     : int [1:150] 3 3 3 3 3 3 3 3 3 3 ...
 $ centers     : num [1:3, 1:2] 1.342 2.037 0.246 4.269 5.596 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:3] "1" "2" "3"
  .. ..$ : chr [1:2] "Petal.Width" "Petal.Length"
 $ totss       : num 551
 $ withinss    : num [1:3] 13.06 16.29 2.02
 $ tot.withinss: num 31.4
 $ betweenss   : num 520
 $ size        : int [1:3] 52 48 50
 $ iter        : int 3
 $ ifault      : int 0
 - attr(*, "class")= chr "kmeans"

Let us plot the points again, now coloring them based on the clusters found by the k-means algorithm:

Using hierarchical clustering

We will perform the same task now using hiercarchical clustering. For this we use the function hclust()

hc_iris

Call:
hclust(d = dis, method = "single")

Cluster method   : single 
Distance         : euclidean 
Number of objects: 150 

Explore the structure of the hc_iris object:

str(hc_iris)
List of 7
 $ merge      : int [1:149, 1:2] -1 -5 -9 -29 -34 -48 -50 -3 -39 -43 ...
 $ height     : num [1:149] 0 0 0 0 0 0 0 0 0 0 ...
 $ order      : int [1:150] 45 25 23 14 44 15 36 24 6 27 ...
 $ labels     : NULL
 $ method     : chr "single"
 $ call       : language hclust(d = dis, method = "single")
 $ dist.method: chr "euclidean"
 - attr(*, "class")= chr "hclust"

To find the cluster assignments in this case, let us cut the dendrogram tree to the proper number of clusters using the cutree() (in our case 3)

And now recreate the plot with the assignment provided by the hierarchical clustering technique

# hierarchical clustering using complete linkage
dis2 <- dist(iris_mini)
# use distance matrix as input for clustering method
hc_iris2 <- hclust(dis2, method = "complete")
# use cutree() to cut the dendrogram
newcut3 <- cutree(hc_iris2, k = 3)
# save the assignment as a factor
hc_results2 <- iris_mini %>%
                mutate(cluster = factor(newcut3))

LS0tCnRpdGxlOiAiSy1tZWFucyBhbmQgSGllcmFyY2hpY2FsIENsdXN0ZXJpbmcgb24gdGhlIGBpcmlzYCBEYXRhc2V0IgpvdXRwdXQ6IGh0bWxfbm90ZWJvb2sKLS0tCgpXZSB3aWxsIGRvIGEgY2x1c3RlcmluZyBhbmFseXNpcyBvbiB0aGUgYGlyaXNgIGRhdGEgc2V0LCB1c2luZyB0aGUgay1tZWFucyB0ZWNobmlxdWUgYW5kIGEgaGllcmFyY2hpY2FsIGNsdXN0ZXJpbmcgYXBwcm9hY2ggCgpUaGUgaXJpcyBkYXRhIHNldCBpcyBvbmUgb2YgUidzIGJ1aWx0LWluIGRhdGEtc2V0cywgYW5kIGlzIGFsc28gYXZhaWxhYmxlIGF0IHRoZSBbVUNJIE1hY2hpbmUgTGVhcm5pbmcgUmVwb3NpdG9yeV0oaHR0cDovL2FyY2hpdmUuaWNzLnVjaS5lZHUvbWwvKS4KClRoZSBkYXRhIHNldCB3YXMgY3JlYXRlZCBieSBzdGF0aXN0aWNpYW4gUm9uYWxkIEZpc2hlciBpbiBoaXMgMTkzNiBwYXBlciBUaGUgdXNlIG9mIG11bHRpcGxlIG1lYXN1cmVtZW50cyBpbiB0YXhvbm9taWMgcHJvYmxlbXMgYXMgYW4gZXhhbXBsZSBvZiBsaW5lYXIgZGlzY3JpbWluYW50IGFuYWx5c2lzLiAgCgpUaGUgZGF0YSBjb25zaXN0cyBvZiAqKjMgY2xhc3NlcyoqIG9mIGZsb3dlciB0eXBlczoKCiAgKiBzZXRvc2EKICAKICAqIHZpcmdpbmljYQogIAogICogdmVyc2ljb2xvcgogIApUaGUgZGF0YSBoYXMgKio0IGF0dHJpYnV0ZXMqKjoKICAKICAqIHNlcGFsIHdpZHRoCiAgCiAgKiBzZXBhbCBsZW5ndGgKICAKICAqIHBldGFsIHdpZHRoCiAgCiAgKiBwZXRhbCBsZW5ndGgKCgpUaGlzIGZhbW91cyAoRmlzaGVyJ3Mgb3IgQW5kZXJzb24ncykgaXJpcyBkYXRhIHNldCBnaXZlcyB0aGUgbWVhc3VyZW1lbnRzIGluIGNlbnRpbWV0ZXJzIG9mIHRoZSB2YXJpYWJsZXMgc2VwYWwgbGVuZ3RoIGFuZCB3aWR0aCBhbmQgcGV0YWwgbGVuZ3RoIGFuZCB3aWR0aCwgcmVzcGVjdGl2ZWx5LCBmb3IgNTAgZmxvd2VycyBmcm9tIGVhY2ggb2YgMyBzcGVjaWVzIG9mIGlyaXMuIFRoZSBzcGVjaWVzIGFyZSBJcmlzIHNldG9zYSwgdmVyc2ljb2xvciwgYW5kIHZpcmdpbmljYS4KClRoZSBidWlsdC1pbiBkYXRhIHNldCBgaXJpc2AgaXMgYSBkYXRhIGZyYW1lIHdpdGggMTUwIGNhc2VzIChyb3dzKSBhbmQgNSB2YXJpYWJsZXMgKGNvbHVtbnMpIG5hbWVkIGBTZXBhbC5MZW5ndGhgLCBgU2VwYWwuV2lkdGhgLCBgUGV0YWwuTGVuZ3RoYCwgYFBldGFsLldpZHRoYCwgYW5kIGBTcGVjaWVzYC4KCiFbXShodHRwczovL3MzLmFtYXpvbmF3cy5jb20vYXNzZXRzLmRhdGFjYW1wLmNvbS9ibG9nX2Fzc2V0cy9pcmlzLW1hY2hpbmVsZWFybmluZy5wbmcpCgoKYGBge3J9CmxpYnJhcnkodGlkeXZlcnNlKQpgYGAKCgpMZXQgdXMgZmlyc3QgY3JlYXRlIGEgcGxvdCB1c2luZyB0aGUgdmFyaWFibGVzIGBQZXRhbC5XaWR0aGAgYW5kIGBQZXRhbC5XaWR0aGAsIGNvbG9yaW5nIGV2ZXJ5IHBvaW50IGJ5IHRoZWlyIGNsYXNzOgoKYGBge3J9CmdncGxvdChpcmlzLCBhZXMoeCA9IFBldGFsLldpZHRoLCB5ID0gUGV0YWwuTGVuZ3RoLCBjb2xvciA9IFNwZWNpZXMpKSArCiAgZ2VvbV9wb2ludCgpCmBgYAoKCiMjIFVzaW5nIGstbWVhbnMKCkxldCB1cyB0cnkgdG8gZmluZCB0aGUgMyBncm91cHMgdXNpbmcgay1tZWFucyAKCmBgYHtyfQojIHVzZSBrbWVhbnMoKSB3aXRoIHRoZSBzdWJzZXQgb2YgY29sdW1ucyB3ZSB3YW50IHRvIGNvbnNpZGVyCmlyaXNfbWluaSA8LSBpcmlzWywgYygiUGV0YWwuV2lkdGgiLCAiUGV0YWwuTGVuZ3RoIildCmttX2lyaXMgPC0ga21lYW5zKHggPSBpcmlzX21pbmksIGNlbnRlcnMgPSAzKQpgYGAKClRoZSBvYmplY3QgYGttX2lyaXNgIGNvbnRhaW5zIGFuIGF0dHJpYnV0ZSBjYWxsZWQgYGNsdXN0ZXJgIHRoYXQgd2Ugd2FudCB0byBleHRyYWN0IHRvIHNlZSB0aGUgZGlmZmVyZW50IGFzc2lnbm1lbnRzOiAKCmBgYHtyfQpzdHIoa21faXJpcykKYGBgCgpMZXQgdXMgcGxvdCB0aGUgcG9pbnRzIGFnYWluLCBub3cgY29sb3JpbmcgdGhlbSBiYXNlZCBvbiB0aGUgY2x1c3RlcnMgZm91bmQgYnkgdGhlIGstbWVhbnMgYWxnb3JpdGhtOgoKLSBGaXJzdCBjb252ZXJ0IHRoZSBhc3NpZ21lbnQgdG8gYSBfZmFjdG9yXyB2YXJpYWJsZToKCmBgYHtyfQprbV9yZXN1bHRzIDwtIGlyaXNfbWluaSAlPiUKICAgICAgICAgICAgICAgIG11dGF0ZShjbHVzdGVyID0gZmFjdG9yKGttX2lyaXMkY2x1c3RlcikpCmttX3Jlc3VsdHMKYGBgCgotIE5vdyBwbG90IHRoZSBwb2ludHMgY29sb3JlZCBieSB0aGVpciBjbHVzdGVycyBmcm9tIHRoZSBrLW1lYW5zIHByb2NlZHVyZQoKYGBge3J9CiMgay1tZWFucyByZXN1bHRzCmttX3Jlc3VsdHMgJT4lCiAgZ2dwbG90KCkgKwogIGdlb21fcG9pbnQoYWVzKHggPSBQZXRhbC5XaWR0aCwgeSA9IFBldGFsLkxlbmd0aCwgY29sb3IgPSBjbHVzdGVyKSkKYGBgCgoKIyMgVXNpbmcgaGllcmFyY2hpY2FsIGNsdXN0ZXJpbmcKCgpXZSB3aWxsIHBlcmZvcm0gdGhlIHNhbWUgdGFzayBub3cgdXNpbmcgaGllcmNhcmNoaWNhbCBjbHVzdGVyaW5nLgpGb3IgdGhpcyB3ZSB1c2UgdGhlIGZ1bmN0aW9uIGBoY2x1c3QoKWAgCgpgYGB7cn0KIyBoaWVyYXJjaGljYWwgY2x1c3RlcmluZyB3aXRoIHNpbmdsZSBsaW5rYWdlCmRpcyA8LSBkaXN0KGlyaXNfbWluaSkKIyB1c2UgZGlzdGFuY2UgbWF0cml4IGFzIGlucHV0IGZvciBjbHVzdGVyaW5nIG1ldGhvZApoY19pcmlzIDwtIGhjbHVzdChkaXMsIG1ldGhvZCA9ICJzaW5nbGUiKQpoY19pcmlzCmBgYAoKRXhwbG9yZSB0aGUgc3RydWN0dXJlIG9mIHRoZSBgaGNfaXJpc2Agb2JqZWN0OgoKYGBge3J9CnN0cihoY19pcmlzKQpgYGAKClRvIGZpbmQgdGhlIGNsdXN0ZXIgYXNzaWdubWVudHMgaW4gdGhpcyBjYXNlLCBsZXQgdXMgY3V0IHRoZSBkZW5kcm9ncmFtIHRyZWUgdG8gdGhlIHByb3BlciBudW1iZXIgb2YgY2x1c3RlcnMgdXNpbmcgdGhlIGBjdXRyZWUoKWAgKGluIG91ciBjYXNlIDMpCgpgYGB7cn0KIyB1c2UgY3V0cmVlKCkgdG8gY3V0IHRoZSBkZW5kcm9ncmFtCmN1dDMgPC0gY3V0cmVlKGhjX2lyaXMsIGsgPSAzKQojIHNhdmUgdGhlIGFzc2lnbm1lbnQgYXMgYSBmYWN0b3IKaGNfcmVzdWx0cyA8LSBpcmlzX21pbmkgJT4lCiAgICAgICAgICAgICAgICBtdXRhdGUoY2x1c3RlciA9IGZhY3RvcihjdXQzKSkKaGNfcmVzdWx0cwpgYGAKCkFuZCBub3cgcmVjcmVhdGUgdGhlIHBsb3Qgd2l0aCB0aGUgYXNzaWdubWVudCBwcm92aWRlZCBieSB0aGUgaGllcmFyY2hpY2FsIGNsdXN0ZXJpbmcgdGVjaG5pcXVlCgpgYGB7cn0KaGNfcmVzdWx0cyAlPiUKICBnZ3Bsb3QoKSArCiAgZ2VvbV9wb2ludChhZXMoeCA9IFBldGFsLldpZHRoLCB5ID0gUGV0YWwuTGVuZ3RoLCBjb2xvciA9IGNsdXN0ZXIpKQpgYGAKCi0gTWF5YmUgaWYgd2UgdHJ5IGEgZGlmZmVyZW50IF9saW5rYWdlXyB3ZSBjYW4gb2J0YWluIGJldHRlciByZXN1bHRzOiAKCgpgYGB7cn0KIyBoaWVyYXJjaGljYWwgY2x1c3RlcmluZyB1c2luZyBjb21wbGV0ZSBsaW5rYWdlCmRpczIgPC0gZGlzdChpcmlzX21pbmkpCiMgdXNlIGRpc3RhbmNlIG1hdHJpeCBhcyBpbnB1dCBmb3IgY2x1c3RlcmluZyBtZXRob2QKaGNfaXJpczIgPC0gaGNsdXN0KGRpczIsIG1ldGhvZCA9ICJjb21wbGV0ZSIpCmBgYAoKCi0gR2V0IG5ldyBhc3NpZ25tZW50cyBmcm9tIHRoZSBjb21wbGV0ZSBsaW5rYWdlIG9wdGlvbgoKYGBge3J9CiMgdXNlIGN1dHJlZSgpIHRvIGN1dCB0aGUgZGVuZHJvZ3JhbQpuZXdjdXQzIDwtIGN1dHJlZShoY19pcmlzMiwgayA9IDMpCiMgc2F2ZSB0aGUgYXNzaWdubWVudCBhcyBhIGZhY3RvcgpoY19yZXN1bHRzMiA8LSBpcmlzX21pbmkgJT4lCiAgICAgICAgICAgICAgICBtdXRhdGUoY2x1c3RlciA9IGZhY3RvcihuZXdjdXQzKSkKYGBgCgoKLSBBbmQgbm93IHJlY3JlYXRlIHRoZSBwbG90IHdpdGggdGhlIGFzc2lnbm1lbnQgcHJvdmlkZWQgYnkgdGhlIGhpZXJhcmNoaWNhbCBjbHVzdGVyaW5nIHRlY2huaXF1ZQoKYGBge3J9CiMgY3JlYXRlIG5ldyBwbG90CmhjX3Jlc3VsdHMyICU+JQogIGdncGxvdCgpICsKICBnZW9tX3BvaW50KGFlcyh4ID0gUGV0YWwuV2lkdGgsIHkgPSBQZXRhbC5MZW5ndGgsIGNvbG9yID0gY2x1c3RlcikpCmBgYAoKCg==